Method and apparatus for providing direct access to unique hierarchical data items

ABSTRACT

A computer implemented method, data processing system, and computer usable program code are provided for accessing unique hierarchical data. A tree structure for a document is analyzed. A determination is made as to whether a set of unique paths exist in the tree structure. Responsive to an existence of the set of unique paths, a unique path identifier is assigned to each of the set of unique paths to create a set of unique path identifiers and assigned unique path pairs. Then, the unique path identifier and a node address for the unique hierarchical data for each of the set of unique path identifiers and assigned unique path pairs is stored into a header in the document disk page.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to databases. More specifically,the present invention relates to a computer implemented method,apparatus, and computer usable program code for accessing hierarchicaldata items.

2. Description of the Related Art

Structured documents are documents which have nested structures.Documents written in Extensible Markup Language (XML) are structureddocuments. XML is quickly becoming the standard format for deliveringinformation on the World Wide Web because this format allows a user todesign a customized markup language for many classes of structureddocuments. XML supports user-defined tabs for better description ofnested document structures and associated semantics, and encouragesseparation of document contents from browser presentation. XML documentshave a hierarchical structure and can conceptually be interpreted as atree structure, called an XML tree.

As more and more businesses present and exchange data in XML documents,the challenge is to store, search, and retrieve these documents usingexisting relational database systems. A relational database managementsystem (RDBMS) is a database management system which uses relationaltechniques for storing and retrieving data. Relational databases areorganized into tables, which consist of rows and columns of data. Adatabase will typically have many tables, and each table will typicallyhave multiple rows and columns. The tables are typically stored ondirect access storage devices (DASD), such as magnetic or optical diskdrives for semi-permanent storage.

Most web applications have connections to databases and use XML totransfer data from the database to the web application and vice versa.Every major database vendor has proprietary extensions for using XMLwith relational databases, but they take completely differentapproaches, and there is no interoperability between them.

Current relational database systems have evolved into hybrid systemsthat store both relational data and XML data. In fact, in more recentversions of International Business Machine's DB2® Database, XML wasintroduced as a data type. SQL/XML and XQuery are new query languagesfor use with the XML data type.

XQuery and SQL/XML are two standards that use declarative, portablequeries to return XML by querying data. In both standards, the XML canhave any desired structure, and the queries can be arbitrarily complex.XQuery is XML-centric, while SQL/XML is SQL-centric. SQL/XML is anextension of SQL that is part of ANSI/ISO SQL 2003. SQL/XML lets SQLqueries create XML structures with a few powerful XML publishingfunctions.

Execution of queries on XML often involves retrieving specific nodesfrom an XML tree by navigating the XML hierarchy following a given path.However, one problem with navigation is that it incurs a significantcomputational overhead as addresses of multiple nodes are computed andde-referenced.

SUMMARY OF THE INVENTION

The different illustrative embodiments provide a computer implementedmethod, data processing system, and computer usable program code foraccessing unique hierarchical data. The illustrative embodiments analyzea tree structure for a document. The illustrative embodiments determinewhether a set of unique paths exist in the tree structure. Theillustrative embodiments assign a unique path identifier to each of theset of unique paths to create a set of unique path identifiers andassigned unique path pairs in response to an existence of the set ofunique paths. The illustrative embodiments store the unique pathidentifier and a node address for the unique hierarchical data for eachof the set of unique path identifiers and assigned unique path pairsinto a header in the document disk page.

In another illustrative embodiment for accessing data, the illustrativeembodiments receive a query request for particular data. Then, theillustrative embodiments determine whether a pointer to the particulardata is found in a data structure containing pointers to a plurality ofnodes in a hierarchical structure in which the plurality of nodesreferenced by unique paths in responsive to receiving the query request.In this illustrative embodiment. the nodes contain data.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features believed characteristic of the invention are setforth in the appended claims. The invention itself, however, as well asa preferred mode of use, further objectives and advantages thereof, willbest be understood by reference to the following detailed description ofan illustrative embodiment when read in conjunction with theaccompanying drawings, wherein:

FIG. 1 is a pictorial representation of a network of data processingsystems in which the exemplary embodiments may be implemented;

FIG. 2 is a block diagram of a data processing system in which theexemplary embodiments may be implemented;

FIG. 3 depicts an exemplary XML tree in accordance with an illustrativeembodiment;

FIG. 4 depicts a pathtable associating unique path expressions withunique numerical path identifiers in accordance with an illustrativeembodiment;

FIG. 5 depicts the layout of a header to be stored in a document diskpage containing XML trees in accordance with an illustrative embodiment;

FIG. 6 depicts a flowchart for creating a header in a document foraccessing unique hierarchical data items using path identifiers inaccordance with an illustrative embodiment; and

FIG. 7 depicts a flowchart for the operation of accessing uniquehierarchical data items using path identifiers in the header of adocument in accordance with an illustrative embodiment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The illustrative embodiments provide for accessing unique hierarchicaldata items using path identifiers in the header of a document. FIGS. 1-2are provided as exemplary diagrams of data processing environments inwhich embodiments may be implemented. It should be appreciated thatFIGS. 1-2 are only exemplary and are not intended to assert or imply anylimitation with regard to the environments in which aspects orembodiments may be implemented. Many modifications to the depictedenvironments may be made without departing from the spirit and scope.

With reference now to the figures, FIG. 1 depicts a pictorialrepresentation of a network of data processing systems in which theexemplary embodiments may be implemented. Network data processing system100 is a network of computers in which embodiments may be implemented.Network data processing system 100 contains network 102, which is themedium used to provide communications links between various devices andcomputers connected together within network data processing system 100.Network 102 may include connections, such as wire, wirelesscommunication links, or fiber optic cables.

In the depicted example, server 104 and server 106 connect to network102 along with storage unit 108. In addition, clients 110, 112, and 114connect to network 102. These clients 110, 112, and 114 may be, forexample, personal computers or network computers. In the depictedexample, server 104 provides data, such as boot files, operating systemimages, and applications to clients 110, 112, and 114. Clients 110, 112,and 114 are clients to server 104 in this example. Network dataprocessing system 100 may include additional servers, clients, and otherdevices not shown.

In the depicted example, network data processing system 100 is theInternet with network 102 representing a worldwide collection ofnetworks and gateways that use the Transmission ControlProtocol/Internet Protocol (TCP/IP) suite of protocols to communicatewith one another. At the heart of the Internet is a backbone ofhigh-speed data communication lines between major nodes or hostcomputers, consisting of thousands of commercial, government,educational and other computer systems that route data and messages. Ofcourse, network data processing system 100 also may be implemented as anumber of different types of networks, such as for example, an intranet,a local area network (LAN), or a wide area network (WAN). FIG. 1 isintended as an example, and not as an architectural limitation fordifferent embodiments.

With reference now to FIG. 2, a block diagram of a data processingsystem is shown in which the exemplary embodiments may be implemented.Data processing system 200 is an example of a computer, such as server104 or client 110 in FIG. 1, in which computer usable code orinstructions implementing the processes for embodiments may be located.

In the depicted example, data processing system 200 employs a hubarchitecture including north bridge and memory controller hub (NB/MCH)202 and south bridge and input/output (I/O) controller hub (ICH) 204.Processing unit 206, main memory 208, and graphics processor 210 areconnected to north bridge and memory controller hub 202. Graphicsprocessor 210 may be connected to north bridge and memory controller hub202 through an accelerated graphics port (AGP).

In the depicted example, local area network (LAN) adapter 212 connectsto south bridge and I/O controller hub 204. Audio adapter 216, keyboardand mouse adapter 220, modem 222, read only memory (ROM) 224, hard diskdrive (HDD) 226, CD-ROM drive 230, universal serial bus (USB) ports andother communications ports 232, and PCI/PCIe devices 234 connect tosouth bridge and I/O controller hub 204 through bus 238 and bus 240.PCI/PCIe devices may include, for example, Ethernet adapters, add-incards and PC cards for notebook computers. PCI uses a card buscontroller, while PCIe does not. ROM 224 may be, for example, a flashbinary input/output system (BIOS).

Hard disk drive 226 and CD-ROM drive 230 connect to south bridge and I/Ocontroller hub 204 through bus 240. Hard disk drive 226 and CD-ROM drive230 may use, for example, an integrated drive electronics (IDE) orserial advanced technology attachment (SATA) interface. Super I/O (SIO)device 236 may be connected to south bridge and I/O controller hub 204.

An operating system runs on processing unit 206 and coordinates andprovides control of various components within data processing system 200in FIG. 2. As a client, the operating system may be a commerciallyavailable operating system such as Microsoft® Windows® XP (Microsoft andWindows are trademarks of Microsoft Corporation in the United States,other countries, or both). An object-oriented programming system, suchas the Java programming system, may run in conjunction with theoperating system and provides calls to the operating system from Javaprograms or applications executing on data processing system 200 (Javais a trademark of Sun Microsystems, Inc. in the United States, othercountries, or both).

As a server, data processing system 200 may be, for example, an IBMeServer™ pSeries® computer system, running the Advanced InteractiveExecutive (AIX®) operating system or Linux® operating system (eServer,pSeries and AIX are trademarks of International Business MachinesCorporation in the United States, other countries, or both while Linuxis a trademark of Linus Torvalds in the United States, other countries,or both). Data processing system 200 may be a symmetric multiprocessor(SMP) system including a plurality of processors in processing unit 206.Alternatively, a single processor system may be employed.

Instructions for the operating system, the object-oriented programmingsystem, and applications or programs are located on storage devices,such as hard disk drive 226, and may be loaded into main memory 208 forexecution by processing unit 206. The processes for embodiments areperformed by processing unit 206 using computer usable program code,which may be located in a memory such as, for example, main memory 208,read only memory 224, or in one or more peripheral devices 226 and 230.

Those of ordinary skill in the art will appreciate that the hardware inFIGS. 1-2 may vary depending on the implementation. Other internalhardware or peripheral devices, such as flash memory, equivalentnon-volatile memory, or optical disk drives and the like, may be used inaddition to or in place of the hardware depicted in FIGS. 1-2. Also, theprocesses may be applied to a multiprocessor data processing system.

In some illustrative examples, data processing system 200 may be apersonal digital assistant (PDA), which is configured with flash memoryto provide non-volatile memory for storing operating system files and/oruser-generated data.

A bus system may be comprised of one or more buses, such as bus 238 orbus 240 as shown in FIG. 2. Of course the bus system may be implementedusing any type of communications fabric or architecture that providesfor a transfer of data between different components or devices attachedto the fabric or architecture. A communications unit may include one ormore devices used to transmit and receive data, such as modem 222 ornetwork adapter 212 of FIG. 2. A memory may be, for example, main memory208, read only memory 224, or a cache such as found in north bridge andmemory controller hub 202 in FIG. 2. The depicted examples in FIGS. 1-2and above-described examples are not meant to imply architecturallimitations. For example, data processing system 200 also may be atablet computer, laptop computer, or telephone device in addition totaking the form of a PDA.

Hierarchical data, such as XML, is natively stored in a database as atree. The nodes in this tree represent data items and the edgesrepresent containment. Edges are stored as pointers inside nodes, suchas child pointer array or parent pointer. Queries for specific dataitems in a tree often use a path pattern specification, such as XPath,that indicates the position of the data item in the tree, relative tothe root of the tree. In order to retrieve the data item indicated by apath, a database engine performs the navigation steps specified by thepath starting from the root. However, performing such navigation stepsspecified by the path starting from the root incurs a significantcomputation overhead, because each path specified in a query needs to betraversed, often for a large number of documents. Thus, the illustrativeembodiments store inside each document disk page a header that containsan array associating each uniquely occurring path pattern with theaddress of the node reachable through that path. A document disk pagemay also be referred to as page cache or disk cache. A document diskpage is a transparent cache of disk-backed pages kept in primary storagefor quicker access.

FIG. 3 depicts an exemplary XML tree in accordance with an illustrativeembodiment. XML tree 300 contains internal nodes 302 that represent XMLelements and leaf nodes 304 that represent data, such as text content. Atypical XML query specifies one or more nodes to be retrieved from adocument by means of path expressions, which may be expressed using theXPath language. For example, the path expression/PurchaseOrder/Seller/Name specifies node 306. Some path expressionsuniquely specify a node, such as node 306 or node 312, while other pathexpressions specify a plurality of nodes. For example, the pathexpression /Purchaseorder/LineItems/Item/Name is matched by nodes 308and 310 in XML tree 300. The illustrative embodiments are directed onlyto path expressions that uniquely specify a node in a document, such asnode 306 or node 312. The information about the uniqueness of nodesspecified by a path expression may be obtained from the document schemaor, if a schema is not provided, directly from the document instance.

FIG. 4 depicts a path table associating unique path expressions withunique numerical path identifiers in accordance with an illustrativeembodiment. Path table 400 identifies path expression 402 and pathidentifier 404 for a number of entries, such as entries 406 and 408.Entry 406 indicates path expression 402 as being/Purchaseorder/Seller/Name, which is the same as the path expression fornode 306 in FIG. 3 and indicates path identifier 404 to be an exemplary“3783”. Entry 408 indicates path expression 402 as being/PurchaseOrder/Buyer/Name, which is the same as the path expression fornode 312 in FIG. 3, and indicates path identifier 404 to be an exemplary“3362”. Path table 400 may be external to the document disk page andused by a database management system (DBMS) in order to reduce the spaceand time required for matching path expressions at query evaluationtime.

FIG. 5 depicts the layout of a header to be stored in a document diskpage containing XML trees in accordance with an illustrative embodiment.In this exemplary embodiment, header 502 is stored within document diskpage 504. Header 502 contains entries 506 and 508 which identify anassociation between uniquely occurring path identifier 510 and nodeaddress 512 and path identifier 514 and node address 516, respectively.Thus, for example, entry 506 contains path identifier 510 correspondingto the path expression /PurchaseOrder/Seller/Name, as shown in pathtable 400 of FIG. 4, and node address 512 contains the address of thecorresponding node.

In retrieving the elements associated with the document, the processor,such as processing unit 206 of FIG. 2, analyzes document disk page 504to determine if header 502 is present. If header 502 is present, theprocessor initiates a query that analyzes header 502 to identify all ofthe path identifiers, such as path identifiers 510 and 514 andreferences a path table to retrieve the path expression for each pathidentifier. Using the retrieved path expression and the node address,such as node address 512 and 516, the query accesses the data at thenode address.

FIG. 6 depicts a flowchart for creating a header in a document foraccessing unique hierarchical data items using path identifiers inaccordance with an illustrative embodiment. As the operation begins, theprocessor analyzes a tree structure, such as XML tree 300 of FIG. 3, fora document (step 602). The processor then determines if at least oneunique path exists in the tree structure (step 604). If at step 604, nounique path exists in the tree structure, then the operation ends. If atstep 604, at least one unique path does exist, then the processorassigns a unique path identifier to each unique path (step 606). Then,the processor loads the unique path identifier and unique path pair intoa path table, such as path table 400 of FIG. 4, (step 608). Theprocessor then creates a header, such as header 502 of FIG. 5, in thedocument disk page (step 610) and stores the unique path identifier andnode address for the unique path pair into the header (step 612), withthe operation terminating thereafter.

FIG. 7 depicts a flowchart for the operation of accessing uniquehierarchical data items using path identifiers in the header of adocument in accordance with an illustrative embodiment. As the operationbegins, the processor receives a request to display a set of elementsfrom a document, specified using path expressions (step 702). A set ofelements may be one element or a plurality of elements. The processorthen determines if the document includes one or more elements that needto be retrieved (step 704). If at step 704, the document does includeelements that need to be retrieved, the processor initiates a query todetermine if a header, such as header 502 of FIG. 5, is preset withinthe document disk page (step 706). If at step 706, a header is presentwithin the document disk page, the query analyzes the document todetermine if the header includes one or more of the requested pathidentifiers (step 708).

If at step 708, the header includes one or more path identifiers, thequery retrieves the path expression corresponding to each pathidentifier (step 710). Using the path expression and the node addressassociated with the path identifier in the header, the query thenretrieves the data at the node address (step 712). For the pathidentifiers which are not found in the header, the query traverses thetree according to the path and retrieves the data at the node address atthe end of the traversal. The processor then displays the document usingthe retrieved data (step 714), with the operation terminatingthereafter.

Returning to step 704, if the document does not include elements thatneed to be retrieved, the processor then displays the document using theretrieved data (step 714), with the operation terminating thereafter.Returning to step 706, if a header is not present within the documentdisk page, the query traverses the tree according to the tree path tothe node address (step 716), with the operation proceeding to step 712thereafter. Returning to step 708, if the header does not include anypath identifiers, the query traverses the tree according to the treepath to the node address (step 716), with the operation proceeding tostep 712 thereafter.

Thus, the illustrative embodiments access unique hierarchical data itemsusing path identifiers in the header of a document. In one embodiment, aquery request is received for particular data and, responsive toreceiving the query request, a determination is made as to whether apointer to the particular data is found in a data structure containingpointers to a plurality of nodes in a hierarchical structure in whichthe plurality of nodes are referenced by unique paths. In thisembodiment, the nodes contain data. In another embodiment, a treestructure for a document is analyzed. A determination is made as towhether a set of unique paths exist in the tree structure. Responsive toan existence of the set of unique paths, a unique path identifier isassigned to the each of the set of unique paths to create a set ofunique path identifiers and assigned unique path pairs. The unique pathidentifier and a node address for the unique hierarchical data for eachof the set of unique path identifiers and assigned unique path pairs isstored into a header in the document disk page.

The invention can take the form of an entirely hardware embodiment, anentirely software embodiment or an embodiment containing both hardwareand software elements. In a preferred embodiment, the invention isimplemented in software, which includes but is not limited to firmware,resident software, microcode, etc.

Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable mediumproviding program code for use by or in connection with a computer orany instruction execution system. For the purposes of this description,a computer-usable or computer readable medium can be any tangibleapparatus that can contain, store, communicate, propagate, or transportthe program for use by or in connection with the instruction executionsystem, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic,infrared, or semiconductor system (or apparatus or device) or apropagation medium. Examples of a computer-readable medium include asemiconductor or solid state memory, magnetic tape, a removable computerdiskette, a random access memory (RAM), a read-only memory (ROM), arigid magnetic disk and an optical disk. Current examples of opticaldisks include compact disk-read only memory (CD-ROM), compactdisk-read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing programcode will include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code in order to reduce the number of times code must beretrieved from bulk storage during execution.

Input/output or I/O devices (including but not limited to keyboards,displays, pointing devices, etc.) can be coupled to the system eitherdirectly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

The description of the present invention has been presented for purposesof illustration and description, and is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the art. Theembodiment was chosen and described in order to best explain theprinciples of the invention, the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

1. A computer implemented method for accessing data, the computerimplemented method comprising: receiving a query request for particulardata; and responsive to receiving the query request, determining whethera pointer to the particular data is found in a data structure containingpointers to a plurality of nodes in a hierarchical structure in whichthe plurality of nodes referenced by unique paths, wherein the pluralityof nodes contain the data.
 2. The computer implemented method of claim1, further comprising: responsive to an absence of the pointer in thepointers in the data structure, traversing the hierarchical structure toidentify a node containing the particular data in the hierarchicalstructure.
 3. The computer implemented method of claim 1, wherein thedata structure is a header.
 4. A computer implemented method foraccessing unique hierarchical data, the computer implemented methodcomprising: analyzing a tree structure for a document; determiningwhether a set of unique paths exist in the tree structure; responsive toan existence of the set of unique paths, assigning a unique pathidentifier to each of the set of unique paths to create a set of uniquepath identifier and assigned unique path pairs; and storing the uniquepath identifier and a node address for the unique hierarchical data foreach of the set of unique path identifiers and the assigned unique pathpairs into a header in a document disk page.
 5. The computer implementedmethod of claim 4, further comprising: receiving a request to display aset of elements from a document, specified using path expressions;determining if the document includes the hierarchical data that needs tobe retrieved; responsive to the document including hierarchical datathat needs to be retrieved, determining if the header is present in thedocument disk page; and responsive to a presence of the header in thedocument disk page, retrieving the set of unique paths specified by eachunique path identifier stored in the header.
 6. The computer implementedmethod of claim 5, further comprising: retrieving the uniquehierarchical data associated with the set of unique paths at the nodeaddress for the unique hierarchical data.
 7. The computer implementedmethod of claim 6, further comprising: displaying the document with theunique hierarchical data.
 8. The computer implemented method of claim 5,further comprising: responsive to an absence of the header in thedocument disk page, traversing the tree structure to the node address toretrieve the unique hierarchical data.
 9. The computer implementedmethod of claim 4, further comprising: loading the set of unique pathidentifiers and the assigned unique path pairs into a path table; 10.The computer implemented method of claim 4, further comprising: creatingthe header in the document disk page associated with the document. 11.The computer implemented method of claim 4, further comprising:responsive to an absence of the set of unique paths, displaying thedocument with the unique hierarchical data.
 12. The computer implementedmethod of claim 4, wherein the tree structure is an extensible markuplanguage tree structure.
 13. A data processing system comprising: a bussystem; a communications system connected to the bus system; a memoryconnected to the bus system, wherein the memory includes a set ofinstructions; and a processing unit connected to the bus system, whereinthe processing unit executes the set of instructions to analyze a treestructure for a document; determine whether a set of unique paths existin the tree structure; assign a unique path identifier to each of theset of unique paths to create a set of unique path identifiers andassigned unique path pairs in response to an existence of the set ofunique paths; and store the unique path identifier and a node addressfor unique hierarchical data for each of the set of unique pathidentifiers and the assigned unique path pairs into a header in adocument disk page.
 14. The data processing system of claim 13, whereinthe processing unit executes the set of instructions to receive arequest to display a set of elements from the document, specified usingpath expressions; determine if the document includes hierarchical datathat needs to be retrieved; determine if the header is present in thedocument disk page in response to the document including thehierarchical data that needs to be retrieved; and retrieve the set ofunique paths specified by each unique path identifier stored in theheader in response to a presence of the header in the document diskpage.
 15. The data processing system of claim 14, wherein the processingunit executes the set of instructions to retrieve the uniquehierarchical data associated with the set of unique paths at the nodeaddress for the unique hierarchical data.
 16. The data processing systemof claim 15, wherein the processing unit executes the set ofinstructions to display the document with the unique hierarchical data.17. A computer program product comprising: a computer usable mediumincluding computer usable program code for accessing unique hierarchicaldata, the computer program product including: computer usable programcode for analyzing a tree structure for a document; computer usableprogram code for determining whether a set of unique paths exist in thetree structure; computer usable program code for assigning a unique pathidentifier to each of the set of unique paths to create a set of uniquepath identifiers and assigned unique path pairs in response to anexistence of the set of unique paths; and computer usable program codefor storing the unique path identifier and a node address for the uniquehierarchical data for each of the set of unique path identifiers and theassigned unique path pairs into a header in a document disk page. 18.The computer program product of claim 17, further including: computerusable program code for receiving a request to display a set of elementsfrom the document, specified using path expressions; computer usableprogram code for determining if the document includes hierarchical datathat needs to be retrieved; computer usable program code for determiningif the header is present in the document disk page in response to thedocument including the hierarchical data that needs to be retrieved; andcomputer usable program code for retrieving the set of unique pathsspecified by each unique path identifier stored in the header inresponse to a presence of the header in the document disk page.
 19. Thecomputer program product of claim 18, further including: computer usableprogram code for retrieving the unique hierarchical data associated withthe set of unique paths at the node address for the unique hierarchicaldata.
 20. The computer program product of claim 19, further including:computer usable program code for displaying the document with the uniquehierarchical data.